-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] add option for per-thread default stream #354
[REVIEW] add option for per-thread default stream #354
Conversation
Can one of the admins verify this patch? |
Is this going to work? cnmem is playing games with grouping allocations per stream and only synchronizing streams when something moves between the pools. With per-thread default stream will cnmem still be able to maintain consistency when it thinks everything is on stream 0, even though it is not? |
CNMeM recognizes this option: https://github.com/NVIDIA/cnmem/blob/master/src/cnmem.cpp#L386 |
Wow, I never noticed that... Interesting. But do you really want to synchronize on EVERY allocation? |
The current CNMeM implementation is definitely not optimized for PTDS. Once we have it enabled, we can probably try to improve it. Also the current algorithm may not be the best at reducing fragmentation. We might want to take a page from jemalloc (http://jemalloc.net/). |
I think we prefer not to put energy into cnmem. Plans are to keep improving RMM's device_memory_resource classes and remove cnmem. E.g. pool_memory_resource is already better than cnmem (except it doesn't have the hack to enable PTDS). jemalloc has a lot of pages. Which one are you referring to? |
I think we can make PTDS work if we use events and event synchronization instead of stream synchronization. in the allocator. The issue we are trying to protect against is
Cuda protects against this when a free happens essentially synchronizing on the device before the memory is allocated again. If we know that memory was intended to be used on stream A when we free the memory we can insert an event into stream A and then not hand that memory to another stream unless the event has completed. The hack for PTDS would be to treat all allocations as being on different streams because we just don't know. This opens things up with Mark's pooling allocator to be able to walk through the free list and look for one where the event has completed by polling instead of blocking, and then only block if there are none that are free. This should make the common case very fast, even with PTDS. |
Yep, this is what @harrism and I have talked about. You create and enqueue an event every time a block of memory is free'd and associate it with that block. Then you need to wait on that event before reclaiming it. There were some additional warts with any time you want to coalesce a block with other blocks, you need to wait on their events, which introduces more synchronization and can cause freeing to no longer be asynchronous. It's certainly possible, just haven't done it yet. |
Just a drive-by question, perhaps @jakirkham has given some thoughts: what happens if rmm is PTDS-enabled but other libraries aren't? This could happen in the Python world when, say, coupling rmm to CuPy. |
We haven't crossed that road quite yet, but I imagine it will cause issues 😅 |
I'm not sure how expensive it is to create and destroy events. Maybe we need an event pool. :) |
Creating and enqueuing events is "free". Waiting on them is not since it's a synchronization. |
Thanks Leo! Yeah generally aware this is happening, but I don't think we are planning on using this for Python yet (as Keith said). |
@harrism added the option to tests and benchmarks. Please take another look. This PR is pretty innocuous. Do you think we can get it merged before attempting more sophisticated PTDS support? As for jemalloc, a couple of things we can borrow are per-thread arenas that allocate small blocks so they don't have to lock, and perhaps first-fit for larger blocks. |
Does CNMeM have any global state? If not, would it be possible to just use a different CNMeM pool per thread? |
Yes if 1/n the memory is large enough. Stealing memory from other threads becomes tricky though. |
Does the pointer to a memory allocation remain the same if it crosses threads when PTDS is enabled? Or does PTDS affect how memory is addressed per thread? |
1/n the memory is not acceptable. First of all there is data skew and we don't want to hard partition the memory like that. Java uses a huge number of threads. For spark, in particular, we schedule more threads than can fit on the GPU so that we can overlap I/O on the CPU with computation on the GPU. To support this scheme it would be a massive undertaking to try and shoehorn sparks threading model into the proposal. |
Please add a changelog entry if you want CI to run. Otherwise we'll get nowhere. :) |
Yes, PTDS does not create a separate address space for threads, rather just a separate, asynchronous CUDA stream per thread. Device address mappings remain the same for the entire process However exchanging device memory addresses between threads will require application-level event or stream synchronization to be safe, since separate threads will not be issuing to the same stream as they do today without PTDS. |
@harrism I added this PR to the changelog, but looks like the CI is still not running. Do you need to whitelist it? |
add to whitelist |
rerun tests |
Co-Authored-By: Keith Kraus <[email protected]>
add to whitelist |
This change adds the option to build RMM with
--default-stream per-thread
enabled.Since RMM doesn't use
nvcc
for non-test code, this is actually done by passing-DCUDA_API_PER_THREAD_DEFAULT_STREAM
to gcc.This is the alternative solution to #352.
By default the option is disabled. To enable it:
I've tested this manually on some spark jobs, which use the CNMeM memory resource. CNMeM should work based on this line (https://github.com/NVIDIA/cnmem/blob/master/src/cnmem.cpp#L386).
The corresponding cuDF change is rapidsai/cudf#4995.
@harrism @jrhemstad @revans2 @jlowe